Behavioral Risk Factor Surveillance System Analysis

Data Exploration¶

The first step in the analysis is fundamental data exploration, which includes importing libraries and reviewing the information, structure, and descriptive statistics of the dataset. The basis for further in-depth analysis and machine learning applications is a challenging check for missing values.

Importing necessary libraries

In [1]:
# For computations using data frames and mathematics
import numpy as np
import pandas as pd

# For Visualisation 
import seaborn as sns
from scipy import stats
import plotly.express as px
import matplotlib.pyplot as plt

#For metrics related to model selection and assessment
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
In [2]:
df = pd.read_csv('CVD_cleaned.csv')
df.head(5)
Out[2]:
General_Health Checkup Exercise Heart_Disease Skin_Cancer Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_(cm) Weight_(kg) BMI Smoking_History Alcohol_Consumption Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
0 Poor Within the past 2 years No No No No No No Yes Female 70-74 150.0 32.66 14.54 Yes 0.0 30.0 16.0 12.0
1 Very Good Within the past year No Yes No No No Yes No Female 70-74 165.0 77.11 28.29 No 0.0 30.0 0.0 4.0
2 Very Good Within the past year Yes No No No No Yes No Female 60-64 163.0 88.45 33.47 No 4.0 12.0 3.0 16.0
3 Poor Within the past year Yes Yes No No No Yes No Male 75-79 180.0 93.44 28.73 No 0.0 30.0 30.0 8.0
4 Good Within the past year No No No No No No No Male 80+ 191.0 88.45 24.37 Yes 0.0 8.0 4.0 0.0

In order to start our research, an initial review of the structural features of the dataset shows a complete picture. With a large dataset of 308,854 rows, the dataset has 19 distinctive features that collectively offer an extensive amount of health-related data.

In [3]:
df.shape
Out[3]:
(308854, 19)

A thorough analysis of the dataset's data is carried out in an effort to gain a more detailed understanding. This includes extracting essential details such as null counts, dataset shape, column names, and the sorts of data it includes. This thorough summary is the first step toward understanding the structure and basic characteristics of the dataset.

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308854 entries, 0 to 308853
Data columns (total 19 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   General_Health                308854 non-null  object 
 1   Checkup                       308854 non-null  object 
 2   Exercise                      308854 non-null  object 
 3   Heart_Disease                 308854 non-null  object 
 4   Skin_Cancer                   308854 non-null  object 
 5   Other_Cancer                  308854 non-null  object 
 6   Depression                    308854 non-null  object 
 7   Diabetes                      308854 non-null  object 
 8   Arthritis                     308854 non-null  object 
 9   Sex                           308854 non-null  object 
 10  Age_Category                  308854 non-null  object 
 11  Height_(cm)                   308854 non-null  float64
 12  Weight_(kg)                   308854 non-null  float64
 13  BMI                           308854 non-null  float64
 14  Smoking_History               308854 non-null  object 
 15  Alcohol_Consumption           308854 non-null  float64
 16  Fruit_Consumption             308854 non-null  float64
 17  Green_Vegetables_Consumption  308854 non-null  float64
 18  FriedPotato_Consumption       308854 non-null  float64
dtypes: float64(7), object(12)
memory usage: 44.8+ MB

A thorough examination of descriptive statistics is conducted in order to go further into the complexity of the dataset. To do this, one must extract the count, mean, minimum, maximum, standard deviation, and quartiles—all important statistical analysis. Through the process of removing these statistical issues, a deeper knowledge of the numerical features of the dataset can be achieved.

In [5]:
df.describe()
Out[5]:
Height_(cm) Weight_(kg) BMI Alcohol_Consumption Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
count 308854.000000 308854.000000 308854.000000 308854.000000 308854.000000 308854.000000 308854.000000
mean 170.615249 83.588655 28.626211 5.096366 29.835200 15.110441 6.296616
std 10.658026 21.343210 6.522323 8.199763 24.875735 14.926238 8.582954
min 91.000000 24.950000 12.020000 0.000000 0.000000 0.000000 0.000000
25% 163.000000 68.040000 24.210000 0.000000 12.000000 4.000000 2.000000
50% 170.000000 81.650000 27.440000 1.000000 30.000000 12.000000 4.000000
75% 178.000000 95.250000 31.850000 6.000000 30.000000 20.000000 8.000000
max 241.000000 293.020000 99.330000 30.000000 120.000000 128.000000 128.000000

In order to ensure the dataset's behavior and reliability, a thorough evaluation of the data types that make up each column is carrying out. This is a crucial step to assure consistency and coherence across the dataset, providing a strong basis for further analytical work.

In [6]:
df.dtypes
Out[6]:
General_Health                   object
Checkup                          object
Exercise                         object
Heart_Disease                    object
Skin_Cancer                      object
Other_Cancer                     object
Depression                       object
Diabetes                         object
Arthritis                        object
Sex                              object
Age_Category                     object
Height_(cm)                     float64
Weight_(kg)                     float64
BMI                             float64
Smoking_History                  object
Alcohol_Consumption             float64
Fruit_Consumption               float64
Green_Vegetables_Consumption    float64
FriedPotato_Consumption         float64
dtype: object

A thorough examination of null values in the dataset is conducted as part of an organized evaluation intended to ensure data quality assurance. After a thorough examination, there are no missing values found in any of the dataset's columns, which is a satisfying result. The cleanness of the dataset is highlighted by the lack of null values, which gives confidence in the accuracy of the subsequent studies.

In [7]:
df.isnull().sum()
Out[7]:
General_Health                  0
Checkup                         0
Exercise                        0
Heart_Disease                   0
Skin_Cancer                     0
Other_Cancer                    0
Depression                      0
Diabetes                        0
Arthritis                       0
Sex                             0
Age_Category                    0
Height_(cm)                     0
Weight_(kg)                     0
BMI                             0
Smoking_History                 0
Alcohol_Consumption             0
Fruit_Consumption               0
Green_Vegetables_Consumption    0
FriedPotato_Consumption         0
dtype: int64

Data Preprocessing¶

The dataset's structural alignment is improved by a careful conversion of column data types. This thorough conversion process is intended to ensure consistency and compliance with existing data guidelines, creating an organized framework that allows smooth data preprocessing.

In [8]:
df = df.astype({
    'General_Health': 'string',
    'Exercise': 'string',
    'Heart_Disease': 'string',
    'Skin_Cancer': 'string',
    'Other_Cancer': 'string',
    'Depression': 'string',
    'Diabetes': 'string',
    'Arthritis': 'string',
    'Sex': 'string',
    'Smoking_History': 'string',
    'Checkup': 'object',
    'Age_Category': 'string',
    'Height_(cm)': 'int64',
    'Weight_(kg)': 'float64',
    'BMI': 'float64',
    'Alcohol_Consumption': 'int64',
    'Fruit_Consumption': 'int64',
    'Green_Vegetables_Consumption': 'int64',
    'FriedPotato_Consumption': 'int64'
})

print(df.dtypes)
General_Health                   string
Checkup                          object
Exercise                         string
Heart_Disease                    string
Skin_Cancer                      string
Other_Cancer                     string
Depression                       string
Diabetes                         string
Arthritis                        string
Sex                              string
Age_Category                     string
Height_(cm)                       int64
Weight_(kg)                     float64
BMI                             float64
Smoking_History                  string
Alcohol_Consumption               int64
Fruit_Consumption                 int64
Green_Vegetables_Consumption      int64
FriedPotato_Consumption           int64
dtype: object

Label Encoding¶

We are introducing 'remove_outliers,' a useful tool to improve data integrity and support robust analysis. With this function, which is customized to work with a DataFrame (df) and an optional z-score threshold (z_threshold), z-scores for numerical columns in the dataset are carefully calculated. It then proceeds to systematically remove rows in which any z-score is greater than the threshold set by the user in order to reduce the effect of outliers. To further improve its usefulness, the method uses Label Encoding to add categorical variable encoding to the DataFrame, which improves consistency and increases computational effectiveness. In order to illustrate how effective it is, the 'remove_outliers' function is applied to the DataFrame df. This allows it to be used to encode categorical columns in the resulting refined dataset. This all-encompassing strategy represents a flexible methodology that is in line with industry best practices for preparing data in order to get it ready for further analytical activities.

In [9]:
 def remove_outliers(df, z_threshold=3):
    z_scores = np.abs(stats.zscore(df.select_dtypes(include=['int64', 'float64'])))
    df_no_outliers = df[(z_scores < z_threshold).all(axis=1)]
    return df_no_outliers

df = remove_outliers(df)

# Encoding Categorical Variables
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
categorical_columns = df.select_dtypes(include=['string', 'object']).columns
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column])

Feature Engineering¶

Utilizing 'extract_age,' an advanced utility function that aims to improve the understanding of age categories in a dataset. Once an age category is input, this method carefully handles a variety of representations, including range indications ('18-24,' '25-34,' and singular values ('65.') The function pays particular attention to age groups indicated by a '+,' which corresponds to '80+' years. The function returns a fixed value of 80 in these cases. When age groups with a '-', which represents a range, occur, the function proactively calculates the mean of the range. When age categories don't have a '-' or '+,' which indicates a single age, the function immediately turns it into an integer. In order to establish a more complex and consistent representation of age-related data, the 'extract_age' function is used to replicate the values from the current 'Age_Category' column and generate a new 'Age' column in the DataFrame df. This function demonstrates practical usefulness.

In [10]:
# Function to extract the age
def extract_age(age_category):
    if '+' in age_category:
        # Handling '80+' as a specific case
        return 80
    else:
        # Splitting the range and calculating the average
        age_range = age_category.split('-')
        if '-' in age_category:
            return (int(age_range[0]) + int(age_range[1])) / 2
        else:
            return int(age_category)

# Applying the function to create the 'Age' column
df['Age'] = df['Age_Category']
df
Out[10]:
General_Health Checkup Exercise Heart_Disease Skin_Cancer Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_(cm) Weight_(kg) BMI Smoking_History Alcohol_Consumption Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption Age
0 3 2 0 0 0 0 0 0 1 0 10 150 32.66 14.54 1 0 30 16 12 10
1 4 4 0 1 0 0 0 2 0 0 10 165 77.11 28.29 0 0 30 0 4 10
2 4 4 1 0 0 0 0 2 0 0 8 163 88.45 33.47 0 4 12 3 16 8
3 3 4 1 1 0 0 0 2 0 1 11 180 93.44 28.73 0 0 30 30 8 11
4 2 4 0 0 0 0 0 0 0 1 12 191 88.45 24.37 1 0 8 4 0 12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
308848 2 3 1 0 0 0 0 0 0 1 7 168 58.97 20.98 0 0 16 12 0 7
308849 4 4 1 0 0 0 0 0 0 1 1 168 81.65 29.05 0 4 30 8 0 1
308851 4 0 1 0 0 0 1 3 0 0 2 157 61.23 24.69 1 4 40 8 4 2
308852 4 4 1 0 0 0 0 0 0 1 9 183 79.38 23.73 0 3 30 12 0 9
308853 0 4 1 0 0 0 0 0 0 0 5 160 81.19 31.71 0 1 5 12 1 5

276089 rows × 20 columns

Initiating a crucial preprocessing step, the extraction of weight values from the 'Weight(kg)' column and subsequent retrieval of centimeter-based height values from the 'Height(cm)' column is performed. The height numbers must be converted to meters in order to compute the Body Mass Index (BMI). This can be done by dividing the height values by 100. By using the BMI formula, ((df['Height_(cm)'] / 100) ** 2) gives the square of the height in meters. This thorough method provides a precise and uniform metric for evaluating body mass across the dataset, the BMI for each dataset entry is calculated by dividing the weight in kilograms by the squared height in meters.

In [11]:
# Simplified BMI calculation
df['BMI'] = df['Weight_(kg)'] / ((df['Height_(cm)'] / 100) ** 2)

Visualisation¶

Distribution of Heart Disease¶

To produce a thorough representation of the distribution of heart disease in the given dataset (df), using Plotly Express library to create a count plot. The graphic clearly shows the incidence of cardiac disease, where a value of 1 indicates the condition's existence and a value of 0 indicates its absence. The resulting figure, appropriately named "Distribution of Heart Disease," shows the frequency of each "Heart_Disease" categorical class on the x-axis. With its clear and informative summary, this graphical depiction is a useful tool for determining the prevalence of heart disease within the dataset.

Output:

When the output is examined, a low incidence of heart disease cases is found, supporting the claim that a small percentage of the dataset has this medical condition.

In [12]:
fig = px.histogram(
    df,
    x='Heart_Disease',
    color='Heart_Disease',
    labels={'Heart_Disease': 'Heart Disease'},
    title='Distribution of Heart Disease',
)

fig.update_layout(
    xaxis_title='Heart Disease',
    yaxis_title='Count',
)

fig.show()

Correlation Analysis¶

Correlation analyses examining the 'Heart_Disease' column and other health-related factors in the dataset show complex relationships. Important details about the dataset may be found in the correlation matrix, which uses correlation coefficients ranging from -1 to 1 to express these associations. Key findings include:

1. Heart Disease and Diabetes (0.168): A significant positive correlation indicates a possible link between diabetes and heart disease, indicating a possible increased risk of heart-related issues in those with diabetes.

2. Heart Disease and Age (0.233): This positive relationship highlights the fact that the risk of developing heart disease increases with age, revealing an age-dependent vulnerability.

3. Skin cancer and other cancers (0.150): The positive association suggests that there may be common risk factors or underlying processes that contribute to the occurrence of several types of cancer in addition to skin cancer.

4. Exercise and Diabetes (-0.135): This shows that regular physical activity may have a preventive impact on the development of diabetes, which is consistent with accepted health principles.

5. BMI and Weight (0.844): The strong positive correlation between BMI and weight highlights the relationship that exists between the two variables, indicating the role that each plays in defining an individual's entire body composition.

6. Exercise and Alcohol Consumption (0.118): The positive association between exercise and alcohol consumption points to a concurrent pattern, which shows that people who exercise may also drink.

These findings provide important insights for focused interventions and public health policies in addition to adding to a deeper knowledge of health connections.

In [13]:
df.corr()
Out[13]:
General_Health Checkup Exercise Heart_Disease Skin_Cancer Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_(cm) Weight_(kg) BMI Smoking_History Alcohol_Consumption Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption Age
General_Health 1.000000 0.026057 0.037862 -0.022939 0.018898 0.002223 0.001567 -0.026001 0.011273 -0.015768 0.029396 0.000088 0.023168 0.024794 0.001689 0.027691 -0.004715 -0.009803 0.002836 0.029396
Checkup 0.026057 1.000000 -0.031425 0.085412 0.079136 0.086693 0.034371 0.128659 0.151135 -0.099826 0.225873 -0.089177 0.011150 0.065015 -0.006792 -0.045097 0.040962 0.049745 -0.066167 0.225873
Exercise 0.037862 -0.031425 1.000000 -0.097616 -0.008178 -0.056206 -0.081230 -0.135287 -0.124046 0.063906 -0.128877 0.097187 -0.067275 -0.140239 -0.094206 0.118789 0.137490 0.146838 -0.033929 -0.128877
Heart_Disease -0.022939 0.085412 -0.097616 1.000000 0.093381 0.093845 0.030819 0.168441 0.155523 0.071980 0.233017 0.015581 0.048619 0.046389 0.109822 -0.051777 -0.019420 -0.020643 -0.012972 0.233017
Skin_Cancer 0.018898 0.079136 -0.008178 0.093381 1.000000 0.150536 -0.011644 0.038111 0.137625 0.007274 0.270318 0.005132 -0.020923 -0.028584 0.031116 0.019914 0.028005 0.030356 -0.042705 0.270318
Other_Cancer 0.002223 0.086693 -0.056206 0.093845 0.150536 1.000000 0.014678 0.066042 0.129958 -0.042369 0.235955 -0.044302 -0.019371 0.005226 0.053710 -0.023363 0.010677 0.004640 -0.040664 0.235955
Depression 0.001567 0.034371 -0.081230 0.030819 -0.011644 0.014678 1.000000 0.048162 0.120713 -0.141150 -0.100843 -0.093349 0.028763 0.095232 0.102184 -0.024138 -0.040335 -0.062374 0.019190 -0.100843
Diabetes -0.026001 0.128659 -0.135287 0.168441 0.038111 0.066042 0.048162 1.000000 0.134115 -0.011600 0.202778 -0.042289 0.147966 0.199774 0.058165 -0.116838 -0.017819 -0.033305 -0.009640 0.202778
Arthritis 0.011273 0.151135 -0.124046 0.155523 0.137625 0.129958 0.120713 0.134115 1.000000 -0.104465 0.375686 -0.103089 0.065784 0.139030 0.124312 -0.049986 0.002429 -0.007205 -0.059778 0.375686
Sex -0.015768 -0.099826 0.063906 0.071980 0.007274 -0.042369 -0.141150 -0.011600 -0.104465 1.000000 -0.065572 0.705696 0.384814 0.015511 0.068577 0.116365 -0.092029 -0.072312 0.157993 -0.065572
Age_Category 0.029396 0.225873 -0.128877 0.233017 0.270318 0.235955 -0.100843 0.202778 0.375686 -0.065572 1.000000 -0.125048 -0.045051 0.018961 0.133802 -0.044179 0.053164 0.077506 -0.172412 1.000000
Height_(cm) 0.000088 -0.089177 0.097187 0.015581 0.005132 -0.044302 -0.093349 -0.042289 -0.103089 0.705696 -0.125048 1.000000 0.509410 -0.018590 0.047971 0.128767 -0.044161 -0.023955 0.135174 -0.125048
Weight_(kg) 0.023168 0.011150 -0.067275 0.048619 -0.020923 -0.019371 0.028763 0.147966 0.065784 0.384814 -0.045051 0.509410 1.000000 0.844531 0.052068 -0.013201 -0.088073 -0.068928 0.118319 -0.045051
BMI 0.024794 0.065015 -0.140239 0.046389 -0.028584 0.005226 0.095232 0.199774 0.139030 0.015511 0.018961 -0.018590 0.844531 1.000000 0.031099 -0.094286 -0.075401 -0.067751 0.055005 0.018961
Smoking_History 0.001689 -0.006792 -0.094206 0.109822 0.031116 0.053710 0.102184 0.058165 0.124312 0.068577 0.133802 0.047971 0.052068 0.031099 1.000000 0.060649 -0.092310 -0.028583 0.042474 0.133802
Alcohol_Consumption 0.027691 -0.045097 0.118789 -0.051777 0.019914 -0.023363 -0.024138 -0.116838 -0.049986 0.116365 -0.044179 0.128767 -0.013201 -0.094286 0.060649 1.000000 0.004865 0.087926 0.025783 -0.044179
Fruit_Consumption -0.004715 0.040962 0.137490 -0.019420 0.028005 0.010677 -0.040335 -0.017819 0.002429 -0.092029 0.053164 -0.044161 -0.088073 -0.075401 -0.092310 0.004865 1.000000 0.249732 -0.099847 0.053164
Green_Vegetables_Consumption -0.009803 0.049745 0.146838 -0.020643 0.030356 0.004640 -0.062374 -0.033305 -0.007205 -0.072312 0.077506 -0.023955 -0.068928 -0.067751 -0.028583 0.087926 0.249732 1.000000 -0.068829 0.077506
FriedPotato_Consumption 0.002836 -0.066167 -0.033929 -0.012972 -0.042705 -0.040664 0.019190 -0.009640 -0.059778 0.157993 -0.172412 0.135174 0.118319 0.055005 0.042474 0.025783 -0.099847 -0.068829 1.000000 -0.172412
Age 0.029396 0.225873 -0.128877 0.233017 0.270318 0.235955 -0.100843 0.202778 0.375686 -0.065572 1.000000 -0.125048 -0.045051 0.018961 0.133802 -0.044179 0.053164 0.077506 -0.172412 1.000000

In the goal of uncovering the numerous relationships within the dataset, a complete exploration is aided by the production of a correlation heatmap. Using the seaborn library as a resource, the correlation matrix is carefully examined to see how different columns interact. A heatmap provides an effective way to visualize the correlation coefficient, which indicates the direction and strength of relationships. The image, which was produced by using the command sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm"), is an effective tool for seeing trends, figuring out dependencies, and pointing out possible directions for more research. This systematic analysis contributes to a comprehensive insight into the many interrelationships buried within the health-related variables by offering a visually intuitive picture of the data's underlying connections.

In [14]:
correlation_matrix = df.corr()
plt.figure(figsize=(16, 14))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

People with Heart Disease and their BMI Distribution¶

The analysis focused on those members of the dataset who were diagnosed with heart disease; this group was identified using the criteria 'Heart_Disease == 1.' This subset was taken out for additional analysis by means of a data filter application. After that, a graphic was created to show how this particular cohort's Body Mass Index (BMI) was distributed. With much care and attention to detail, a histogram was created using the Plotly Express module. The resulting plot, called "BMI Distribution for People with Heart Disease," provides an excellent task of illustrating the range of BMIs. "BMI" is indicated on the x-axis, while the frequency of individuals is quantified on the y-axis. The 'bargap' option was carefully changed to regulate the distance between the histogram bars in order to fine-tune the appearance. This visual representation offers a perceptive look at the BMI distribution patterns among people with heart disease diagnoses.

Output:

As we can see, the vast majority of people with Heart Disease 3240, have BMI levels that fall within the category of overweight which is 28 - 29.999, signifying that the people with Heart Disease are obese.

The normal BMI range is 18.5 to 24.9, and if the range falls between 25.0 and 29.9, they are known overweight, and BMI levels less than 18.5 are considered underweight.

In [15]:
# Selecting only people with heart disease
heart_disease_df = df[df['Heart_Disease'] == 1]

fig = px.histogram(
    heart_disease_df,
    x='BMI',
    nbins=30,
    labels={'BMI': 'Body Mass Index'},
    title="BMI Distribution for People with Heart Disease",
)

fig.update_layout(
    xaxis_title="BMI",
    yaxis_title="Count",
    bargap=0.1,  # Adjust the gap between bars
)

fig.show()

Heart Disease vs. History of Smoking¶

The provided code builds a histogram showing the link between the presence or absence of heart disease (Heart_Disease) and smoking history (Smoking_History) using Plotly Express. The bars on the x-axis are color-coded to distinguish between those with and without heart disease, and each bar represents a distinct smoking history group. The generated plot, "Smoking History vs. Heart Disease," provides a concise summary of the distribution of heart disease cases among different smoking history categories.The plot, displayed using fig.show(), presents a binary encoding where 0 signifies the absence of heart disease and no smoking history, while 1 indicates the presence of heart disease and a corresponding smoking history.

Output:

The plot's observation implies that while people with heart disease are more likely to have smoked in the past, people without a smoking history typically do not develop heart disease.

In [16]:
import plotly.express as px

fig = px.histogram(
    df,
    x='Smoking_History',
    color='Heart_Disease',
    labels={'Smoking_History': 'Smoking History', 'Heart_Disease': 'Heart Disease'},
    title='Smoking History vs. Heart Disease',
)

fig.update_layout(
    xaxis_title='Smoking History',
    yaxis_title='Count',
)

fig.show()

Skin Cancer prevalence by Gender and Age Group¶

Utilizing data grouping (groupby()) to evaluate the average prevalence of skin cancer in various age groups and genders. By calculating the mean skin cancer prevalence and resetting the index, a new distribution function named "skin_cancer_prevalence" is formed, using the 'Age_Category' and 'Sex' variables for grouping. The following visualization is a bar chart created with Seaborn, where the y-axis represents mean skin cancer prevalence and the x-axis represents gender and age groups. The figure size of the chart, as it is displayed, is 12 by 6. With the headline "Skin Cancer Prevalence by Age Category and Gender," the x- and y-axes have corresponding labels on the plot.

Output:

The output chart highlights differences between male ( denoted as 1, displayed in orange) and female ( denoted as 0, shown in blue) populations by visually representing the occurrence of skin cancer across age groups. The chart is important because it shows differences in the occurrence of skin cancer by gender across various age groups.

In [17]:
skin_cancer_prevalence = df.groupby(['Age_Category', 'Sex'])['Skin_Cancer'].mean().reset_index()

# Creating a bar chart to visualize skin cancer prevalence
plt.figure(figsize=(12, 6))
sns.barplot(data=skin_cancer_prevalence, x='Age_Category', y='Skin_Cancer', hue='Sex')
plt.title('Skin Cancer Prevalence by Age Category and Gender')
plt.xlabel('Age Category')
plt.ylabel('Prevalence Rate')
plt.show()

Feature Selection¶

The method for predicting heart disease that is being presented using machine learning and feature engineering. As the target variable 'y,' the 'Heart_Disease' column is first removed from the dataset; the remaining features form the feature variable 'X.' The dataset is then divided into training and testing sets using the 'train test split' (X, y, test_size=0.2, random_state=42) function, where a random seed of 42 is used for reproducibility and a test size of 20% is specified. Using the training dataset, a RandomForestClassifier with 100 decision trees is trained. Next, a 'feature_importances' variable is produced by applying the trained random forest model to evaluate the significance of each feature. 'X_train_selected' and 'X_test_selected' are the outcomes of applying feature selection to the training and testing data by utilizing this information. The 'sfm.get_support(indices=True)' function helps in the selection process by providing the index of the selected features, making it possible to extract and print their names later. To improve predictive accuracy in the context of classifying heart diseases, this all-inclusive methodology combines feature selection, model training, and data preparation.

In [18]:
# Spliting the data into feature matrix (X) and target variable (y)
X = df.drop(columns=['Heart_Disease'])
y = df['Heart_Disease']

#Spliting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training a random forest classifier model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Performing feature selection
feature_importances = model.feature_importances_

# Creating a SelectFromModel object to select features based on a threshold
sfm = SelectFromModel(model, threshold='median')
sfm.fit(X_train, y_train)
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

# Now, X_train_selected and X_test_selected contain the selected features

# Printing the selected features
selected_feature_indices = sfm.get_support(indices=True)
selected_features = X.columns[selected_feature_indices]
print("Selected features:", selected_features)
Selected features: Index(['General_Health', 'Age_Category', 'Height_(cm)', 'Weight_(kg)', 'BMI',
       'Alcohol_Consumption', 'Fruit_Consumption',
       'Green_Vegetables_Consumption', 'FriedPotato_Consumption', 'Age'],
      dtype='object')

Model Training and Evaluation¶

Random Forest¶

Using 100 decision trees, the Random Forest Classifier predicts the presence of heart disease with a noteworthy accuracy of 91.7% on the testing dataset. This assessment includes a number of indicators, such as a classification report and a confusion matrix, to give a thorough picture of the model's functionality. According to the confusion matrix, there were 50,498 true negatives (correctly predicted cases of non-heart disease) out of 55,218 instances; 189 false positives (misclassified as cases of heart disease when they are not); 4,397 false negatives (misclassified as cases of non-heart disease when they are); and 134 true positives (correctly predicted patients with heart disease).

The following classification report explores measures for both classes, including precision, recall, and F1-score. Notably, there is a 92% precision rate for the absence of heart illness and a 41% precision rate for the diagnosis of heart disease. The greatest (100%) recall for the absence of heart disease highlights the model's accuracy in detecting cases that do not involve heart disease. However, the recall for heart illness is only 3%, suggesting a significant difficulty in correctly identifying positive cases. The F1-scores also show this gap, with 6% for heart disease and 96% for non-heart disease. With a weighted average accuracy of 92%, more investigation is clearly required, especially to resolve false negatives and improve the model's sensitivity to cases of heart disease.

In [19]:
# 1: Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

rf_y_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_y_pred)
rf_conf_matrix = confusion_matrix(y_test, rf_y_pred)
rf_class_report = classification_report(y_test, rf_y_pred)

# Printing the evaluation results
print("Random Forest Classifier:")
print("Accuracy:", rf_accuracy)
print("Confusion Matrix:\n", rf_conf_matrix)
print("Classification Report:\n", rf_class_report)
Random Forest Classifier:
Accuracy: 0.9169473722336919
Confusion Matrix:
 [[50498   189]
 [ 4397   134]]
Classification Report:
               precision    recall  f1-score   support

           0       0.92      1.00      0.96     50687
           1       0.41      0.03      0.06      4531

    accuracy                           0.92     55218
   macro avg       0.67      0.51      0.51     55218
weighted avg       0.88      0.92      0.88     55218

Support Vector Machine¶

It is necessary to train the Support Vector Machine (SVM) model and then assess its performance in order to use it for prediction on the testing dataset. The SVM model is trained by an accurate approach to provide predictions, which are evaluated by comparing the predictions to the target values. The assessment consists of determining accuracy, creating a confusion matrix, and thoroughly analyzing performance metrics through the use of a categorization report. Important indicators of the SVM model's effectiveness in classifying objects are the comprehensive insights into its accuracy, confusion matrix, and classification report. This systematic method facilitates an in-depth understanding of the predictive power of the model and helps with well-informed decision-making on more refinements or alternative model considerations.

Output:

The accuracy of the SVM model is 91.8%. The confusion matrix, on the other hand, shows a notable imbalance: the model correctly detects cases without heart disease (class 0), but it is unable to recognize any cases of heart disease (class 1). As a result, class 1's precision, recall, and F1-score metrics are zero. The SVM's ability to accurately forecast class 0 instances is the main factor influencing the weighted average accuracy of 92%. The macro-average measurements, which show that precision, recall, and F1-score are all focused around 0.5, highlight the difficulties caused by class imbalance. These results call for a thorough assessment and investigation of possible improvements or different models to solve the model's flaws.

In [20]:
# 2: Support Vector Machine (SVM)
svm_model = SVC()
svm_model.fit(X_train, y_train)

svm_y_pred = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_y_pred)
svm_conf_matrix = confusion_matrix(y_test, svm_y_pred)
svm_class_report = classification_report(y_test, svm_y_pred)


print("Support Vector Machine (SVM):")
print("Accuracy:", svm_accuracy)
print("Confusion Matrix:\n", svm_conf_matrix)
print("Classification Report:\n", svm_class_report)
Support Vector Machine (SVM):
Accuracy: 0.917943424245717
Confusion Matrix:
 [[50687     0]
 [ 4531     0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.92      1.00      0.96     50687
           1       0.00      0.00      0.00      4531

    accuracy                           0.92     55218
   macro avg       0.46      0.50      0.48     55218
weighted avg       0.84      0.92      0.88     55218

/Users/ishratshaikh/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/ishratshaikh/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/ishratshaikh/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

Logistic Regression¶

This code trains and evaluates a logistic regression model, which is a predictive tool for forecasting results. A different set of data is used to evaluate the predictive accuracy of the model after it has been trained on a specific training dataset. The evaluation looks at the confusion matrix, which lists the occurrences of accurate and inaccurate predictions together with the model's overall accuracy. In addition, a classification report is produced to offer a thorough summary of the model's performance in several classes. Understanding the precision, recall, and F1-score metrics is made easier with the help of this report, which provides information on the accuracy and efficiency of the Logistic Regression model in predicting heart disease cases.

Output:

The accuracy obtained by applying Logistic Regression on the dataset is 91.7%. The model is good at identifying cases without heart disease (class 0), but it has trouble identifying cases with heart disease (class 1), which leads to more false negatives. Key metrics are shown in the classification report, which highlights a 41% precision for heart disease and a 2% recall, highlighting the model's challenges in correctly detecting positive cases. The weighted average accuracy of 92% demonstrates how well the algorithm can predict patients free of heart disease. On the other hand, macro-average metrics highlight the effect of class imbalance, which calls for a thorough analysis and possible improvements to address the model's limitations.

In [21]:
# 3: Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)


lr_y_pred = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_y_pred)
lr_conf_matrix = confusion_matrix(y_test, lr_y_pred)
lr_class_report = classification_report(y_test, lr_y_pred)


print("Logistic Regression:")
print("Accuracy:", lr_accuracy)
print("Confusion Matrix:\n", lr_conf_matrix)
print("Classification Report:\n", lr_class_report)
Logistic Regression:
Accuracy: 0.9173095729653374
Confusion Matrix:
 [[50569   118]
 [ 4448    83]]
Classification Report:
               precision    recall  f1-score   support

           0       0.92      1.00      0.96     50687
           1       0.41      0.02      0.04      4531

    accuracy                           0.92     55218
   macro avg       0.67      0.51      0.50     55218
weighted avg       0.88      0.92      0.88     55218

/Users/ishratshaikh/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression